US National Parks and Gas Prices¶

Having traveled to many national parks, we were interested in learning more about the national sites spanning the United States and how trends in park visitors have changed over time. This analysis uses visitor data from the National Park Service, a state population dataset (US Census), and gas price data (US Dept. of Commerce) since national parks in the US are primarily accessible by road. We ultimately wanted to investigate if visitor numbers changed with gas prices. The report begins with an exploration of visitor trends, then an investigation of gas price fluctuations, and a final synthesis of the two factors, accounting for population growth. The scope of the analysis was limited to include years common to the combined data; most of the data manipulation involved dataset merges.

*I worked on this notebook with a classmate; this was our final project for a class.

Data Setup¶

In [45]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

plt.rcParams.update({'font.size': 13})
In [46]:
parks = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/national_parks.csv")
gas = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/gas_price.csv")
pop = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/state_pop.csv")
In [47]:
# Data cleaning

parks = parks[parks["year"].str.contains("Total")==False]
parks["year"] = parks.year.astype("int")
locations = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/locations.csv")
In [48]:
parks.year.min()
Out[48]:
1904
In [ ]:
 

How popular have the parks been over time?¶

In [60]:
popyear = pd.DataFrame(pop.groupby(by="year").pop.sum()).reset_index()
df = pd.DataFrame(parks.groupby(by="year").visitors.sum()).reset_index()
popyear = popyear.merge(df, how="inner", on="year")
popyear = popyear.rename(columns={"year": "Year"})
popyear["pop"] = popyear["pop"]/1000000
popyear["visitors"] = popyear["visitors"]/1000000
popyear.head(3)
Out[60]:
Year pop visitors
0 1904 82.165 0.120690
1 1905 83.818 0.140954
2 1906 85.439 0.030569
In [62]:
plt.figure(figsize=(10,5))
sns.set_style("whitegrid")
    #"darkgrid", {"axes.facecolor": ".9"})

fig = sns.scatterplot(data=popyear, x='pop', y='visitors', hue='Year', palette="winter")


plt.title('Visitors to National Parks vs. US Population (1904-2015)', pad=10)
plt.xlabel('US Population (Millions)', labelpad=10)
plt.ylabel('Visitors to National Parks (Millions)', labelpad=10)

print()

#plt.savefig("fig5_pop.jpeg")

Figure 1: Scatterplot showing the relationship between the US population and visitors to national parks each year. There is a positive relationship between visitors and the US population each year, and the two variables have both increased from the 1900s to the 2000s. The graph follows an s-shaped curve. The positive relationship seems to end around 150 million people (1980s) where the population does not have a great effect on the number of visitors.

Discussion: As the US population increased over the years, visits to national parks have also become more common. National park visits have become more popular than ever, and we are interested in exploring how that has changed over time by location. It is interesting that after the 1980s, the number of visitors to parks has stayed relatively the same throughout the years. A possible explanation could have to do with the limitations of the number of people allowed at parks at a certain time and the parks are meeting these capacities almost every year. As more parks are added to the National Park Services, the number of visitors will be able to increase as well.

In [ ]:
 

Let's look at the most popular parks¶

In [49]:
# Finding the top five most visited US national sites of all time

ranks = pd.DataFrame(parks.groupby(by=["unit_name","region"]).visitors.sum()).reset_index()
ranks = ranks.sort_values(by="visitors", ascending=False)
partial = ranks[["unit_name", "visitors"]].head(5)
partial
Out[49]:
unit_name visitors
33 Blue Ridge Parkway 871922828.0
152 Golden Gate National Recreation Area 611031225.0
162 Great Smoky Mountains National Park 521947058.0
253 Natchez Trace Parkway 443145232.0
213 Lake Mead National Recreation Area 411700377.0
In [50]:
# Separating the data by park and by year, now with the all-time visitor count

bypark = parks[["unit_name","year","visitors"]]
bypark = bypark.merge(partial, how="right", on="unit_name")
bypark = bypark.rename(columns={"visitors_x":"Visitors", "visitors_y":"Visitors (All-Time)","unit_name":"Park Name"})
bypark = bypark.sort_values(by="year")
bypark.head(3)
Out[50]:
Park Name year Visitors Visitors (All-Time)
120 Great Smoky Mountains National Park 1931 154000.0 521947058.0
205 Great Smoky Mountains National Park 1932 300000.0 521947058.0
204 Great Smoky Mountains National Park 1933 375000.0 521947058.0
In [51]:
fig = px.line(bypark, x='year', y='Visitors', 
              title='Number of Visitors Per Year at the Five Most Visited US National Parks (1931-2016)',color="Park Name")

fig.update_xaxes(rangeslider_visible=True)
fig.show(renderer='notebook')

fig.write_html("fig1_line.html")

Figure 2: This graph displays visitor trends from 1931 to 2016 at the five most visited US national parks, determined by the total number of visitors to each park since its inception. All five sites show a clear increase in visitors over time. The sliding time scale at the bottom can be adjusted to reveal details in year-to-year data, and the park scope can be limited by deselecting park names.

Discussion: We selected the top 5 most visited sites of all time in the United States to examine their visitor trends. Some of the sites have been established as national sites since the beginning of this data record, such as the Great Smoky Mountains, which has been named a national park since 1926. On the other hand, the Golden Gate Bridge was built in 1933, and the Golden Gate Recreation area was created in 1972, so data on visitors to this area were not recorded until 1972. Even with these disparities, Golden Gate Recreation Area is one of the most visited parks in the US. Looking only at the 70s to 90s when the Golden Gate Recreation Area was first established as a national site, the number of visits for this specific park spiked, while the other parks were consistently growing slowly. Blue Ridge Parkway in North Carolina is a road with stunning views, making it a highly accessible national site, which is a possible reason that this park consistently has a high number of visitors every year. The drastic decrease in visitors that occurred at Natchez Trace Parkway in the late 1980s has to do with the fact that it was under reconstruction during this time, which must have limited the park's visitors.

In [ ]:
 
In [ ]:
 

What do visits look like across the US?¶

In [52]:
import descartes
import geopandas as geo
from shapely.geometry import Point, Polygon

%matplotlib inline
earth = geo.read_file(geo.datasets.get_path('naturalearth_lowres'))
In [53]:
park = parks[['year','gnis_id','parkname','region','state','unit_name','unit_type','visitors']]
parks = park.merge(locations, on='gnis_id', how='left')
parks = parks[(parks['lon'] != (parks.lon.max())) & (parks['year'] != 'Total')]
parks = parks.sort_values(by='year')
In [54]:
fig = px.scatter_mapbox(parks, lon="lon", lat="lat", animation_frame="year", \
                        color="visitors", hover_name="unit_name", hover_data=["visitors"],\
                        zoom=2)

fig["layout"].pop("updatemenus") # optional, drop animation buttons
fig.update_layout(mapbox_style="open-street-map", 
                  margin={"r":0,"t":60,"l":0,"b":0},
                  title_text='Number of Visitors at US National Sites from 1929 to 2016')

fig.show(renderer='notebook')

fig.write_html("fig2_map.html")

Figure 3: This graph represents the location of the US National Sites and the number of visitors to each park by year. At the bottom of the graph, there is an interactive slider to change the year. This data was obtained through Data World and collected by the National Park services. The number of visitors per year corresponds to the color scale on the right of the graph, and it adjusts every year for the appropriate amount of visitors. Hovering over each park reveals details about the name of the site, the year, the location in latitude and longitude, and the number of visitors that year. Most of the visitors are the same throughout the site with a few outliers of the most popular parks and historical sites that are visited significantly more than the other locations.

Discussion: In this graph, we wanted to depict the location of each park as well as the total number of visitors for that year to observe any spatial patterns. It is interesting to see how the number of recognized parks and sites increases from the first year of data (1929) to the last year of data collection (2016). Most of the parks have the same number of visitors, but there are a handful of national parks and sites that have significantly more visitors. If you hover over the parks, you can see which ones are the most visited for that year. After looking at Figure 1, there is a pattern between the two, where the top most visited parks are consistently the ones that stand out as yellow (the highest number of visitors represented in yellow). It was interesting to see Blue Ridge Parkway as a high visitor attraction because we had never heard of it, but when looking more into the Parkway, it is clear that the views and vistas are some of the most gorgeous in the United States. There are no significant spatial trends in park visits; the most popular sites are scattered throughout the country. It is also interesting to watch the number of parks grow every year. Thinking back to Figure 1, the increase in park visitors could have also to do with the increase in parks, or on the other hand, an increase in demand could have led to the construction and formation of new parks.

In [ ]:
 

How have gas prices changed?¶

In [55]:
parks = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/national_parks.csv")
parks = parks[parks["year"].str.contains("Total")==False]
parks["year"] = parks.year.astype("int")
In [56]:
import plotly.io as pio
import plotly.express as px
In [57]:
plt.figure(figsize=(10,5))
sns.set_style("whitegrid")
    #"darkgrid", {"axes.facecolor": ".9"})

sns.lineplot(data=gas, x='year', y='gas_current', label = 'Current (Nominal)', color='red')
sns.lineplot(data=gas, x='year', y='gas_constant', label = 'Constant (Real)', color='orange')

plt.title('Average Gas Prices in the US from 1929 to 2016', pad=10)
plt.xlabel('Year', labelpad=10)
plt.ylabel('Price (US Dollars)', labelpad=10)

print()

plt.savefig("fig3_gas.jpeg")

Figure 4: In this graph, the average price of gas in US dollars is measured from the year 1929 to 2016. The current average price (in red) represents the average gas price at that time, not adjusted for inflation. The constant average price (in orange) is adjusted for inflation and represents the gas price in 2015 dollars. While the current price increases over time, the constant price remains steadier. This data was obtained from the Office of Energy Efficiency & Renewable Energy via Energy.org.

Discussion: We wanted to explore gas price trends in the US since its national parks are largely accessible by road (car, bus, RV). Although inflation has caused the current (nominal) price of gas to increase significantly in the past century, the constant (adjusted) gas price has fluctuated quite a bit but remained around $2.50, as shown in the graph. Year-to-year constant gas prices have fluctuated, particularly with the Great Recession in 2008, but the average has remained roughly the same. Having visualized these yearly gas price fluctuations, we are interested in seeing how this relates to visits to national parks.

In [ ]:
 

Can gas prices tell us something about park visits?¶

In [58]:
poptot = pop.groupby(by="year")["pop"].sum()
poptot = pd.DataFrame(poptot).reset_index()
merge = parks.merge(gas, how="right", on="year")
totals = parks.groupby(by="year")["visitors"].sum()
totals = pd.DataFrame(totals).reset_index()
totals = totals.merge(gas, how="inner", on="year")
totals = totals.merge(poptot, how="inner", on="year")
totals["Visits per Capita"] = totals["visitors"] / totals["pop"]
totals = totals.rename(columns={"gas_constant": "Constant Gas Price ($ per gallon)", "year": "Year"})
In [59]:
#sns.scatterplot(data=totals, x="gas_current", y="percap", label="Nominal Dollars")

fig = px.scatter(totals, x="Constant Gas Price ($ per gallon)", y="Visits per Capita", color="Year", 
                 title="Gas Prices vs. Number of Visits to US National Sites per Capita from 1929 to 2015",
                color_continuous_scale = ["#5CDA56", "#4DC8C3", "#322FCB", "#B838F4", "#F79DDC"])

fig1 = px.line(totals, x="Constant Gas Price ($ per gallon)", y="Visits per Capita")

pio.templates.default="plotly_white"

fig.show(renderer='notebook')

fig.write_html("fig4_scatter.html")

print()

Figure 5: This scatterplot displays the relationship between real gas prices in USD and the number of visits per capita each year to national parks, monuments, and sites in the US from 1929 to 2015. Park visits increased steadily until the 1980s, after which they plateaued at roughly 1 visit per capita. It appears that gas prices and park visits were negatively related approximately until the 1970s; however, in more recent years, there has been no visible relationship between the two variables. The year is encoded by color, and the plot contains additional information in the hover feature.

Discussion: There was a visible negative correlation between gas prices and visits per capita until the 1970s, which we assumed would happen every year since more people are likely to drive (and visit national parks) when gas prices were less expensive. Interestingly enough, after the 1970s it appears that gas prices have had little effect on park visits since. Even as adjusted gas prices ranged from roughly \$1.50 to over \\$3.50 per gallon from the 1980s onward, the number of park visits has consistently stayed at about 1 visit per capita each year. We suspect that part of this has to do with the increased popularity of electric vehicles. They have made travel to national parks even less dependent on fluctuating gas prices, and the data shows that trips to national parks are not contingent on gas prices anymore.

In [ ]:
 

Thanks for reading!¶

Citations:

Dataset Title: US National Parks Visitation (1904-2016) Author/Publisher: Tidy Data Repository Data Provided By: National Park Services Date Published: 2017 URL: https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-09-17 https://data.world/inform8n/us-national-parks-visitation-1904-2016-with-boundaries/activity

Dataset Title: State Population Author/Publisher: Wikipedia Data Provided By: US Census Bureau Date Published: 2022 URL: https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_historical_population

Dataset Title: Average Historical Annual Gasoline Pump Price, 1929-2015 Author/Publisher: Energy Information Administration Data Provided By: U.S. Department of Commerce, Bureau of Economic Analysis Date Published: 2022 URL: https://www.energy.gov/eere/vehicles/fact-915-march-7-2016-average-historical-annual-gasoline-pump-price-1929-2015

In [ ]: